{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text Preprocessing\n",
    "\n",
    "For any NLP tasks in Deep Learning the first step would be preprocessing the text data into numbers!\n",
    "\n",
    "In the recent years almost all the DL packages have started to provide their own APIs to do the text preprocessing, however each one has its own subtle differences, which if not understood correctly will lead to improper data preparation and thus skewing model trianing.\n",
    "\n",
    "When I resumed my hobby in DL with Transformers + Tensorflow 2.0, I came across different APIs doing the same text tokneization as part of the Tensorflow ecosystem tutorials.\n",
    "\n",
    "From the days of writing our own tokenizer and encoders/decoders, we now have APIs which can simplify our work a lot. However care should be taken while using such APIs, like \n",
    "- How you wanted the text to be splitted?\n",
    "- How the tokenizers wanted to handle the punctuations/special characters?\n",
    "- How to handle out of vocab word (OOV)?\n",
    "- Do you wanted to use [WordPiece tokenization](https://stackoverflow.com/questions/55382596/how-is-wordpiece-tokenization-helpful-to-effectively-deal-with-rare-words-proble/55416944#55416944)?\n",
    "- Does the tokenizer/enoder support charcter level encoding ?\n",
    "- How is vocab length is calculated? does it include PAD and OOV words in it?\n",
    "\n",
    "Choosing the right API to do our task with multiple options out there is not an easy job, as each API is build with specific purpose to fit with its counter parts. Some wors natively with Tensors, somw with Tensrflow datasets, some with character level etc.,\n",
    "\n",
    "This is a quick skim through reference blog for word and character level encoding in Tensorflow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from string import punctuation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "import tensorflow_text\n",
    "import tensorflow_datasets as tfds"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data is a sample from [CoNLL 2003](https://www.clips.uantwerpen.be/conll2003/ner/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "text_data = [\"4. Kurt Betschart - Bruno Risi ( Switzerland ) 22\",\n",
    "            \"Israel approves Arafat 's flight to West Bank .\",\n",
    "            \"Moreau takes bronze medal as faster losing semifinalist .\",\n",
    "            \"W D L G / F G / A P\",\n",
    "            \"-- Helsinki newsroom +358 - 0 - 680 50 248\",\n",
    "            \"M'bishi Gas sets terms on 7-year straight .\"]\n",
    "ner_data = [\"O B-PER I-PER O B-PER I-PER O B-LOC O O\",\n",
    "            \"B-LOC O B-PER O O O B-LOC I-LOC O\",\n",
    "            \"B-PER O O O O O O O O\",\n",
    "            \"O O O O O O O O O O\",\n",
    "            \"O B-LOC O O O O O O O O\",\n",
    "            \"B-ORG I-ORG O O O O O O\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_word, end_word, unknown_word = \"<START>\", \"<END>\", \"<UNK>\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Three set of APIs are explored\n",
    "- Tensorflow Dataset APIs\n",
    "- Tensorflow Keras Text Preprocessing\n",
    "- Tensorflow Text\n",
    "\n",
    "For my current task Keras APIs solved my requirements, i.e word and character tokenizing, encoding and decoding.\n",
    "\n",
    "Note: Decoding will be updated if I get time.:)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Tensorflow Dataset API\n",
    "Like many I started the TRansformers from this tutorial which uses the Tensorflow Dataset APIs.\n",
    "https://www.tensorflow.org/tutorials/text/transformer\n",
    "\n",
    "- The API is clean and easy to use.\n",
    "- https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/Tokenizer\n",
    "- https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/TextEncoder\n",
    "- Here we need Tokenizer and Encoder seprately.\n",
    "\n",
    "Cons:\n",
    "- For the task of preparing the text for NER, we have to consider all special characters, which by default is ignored.\n",
    "- Even if we add the `punctuation` as reserved tokens, it still removes the special characters while tokenizing\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "text_tokenizer = tfds.features.text.Tokenizer(reserved_tokens=[start_word, end_word] + list(punctuation))\n",
    "tags_tokenizer = tfds.features.text.Tokenizer(reserved_tokens=['B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC',\n",
    "                                                               'I-MISC', 'I-ORG', 'I-PER', 'O',\n",
    "                                                               start_word, end_word])\n",
    "text_vocabulary_set = set()\n",
    "tags_vocabulary_set = set()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "for text, ner in zip(text_data, ner_data):\n",
    "    text_tokens = text_tokenizer.get_keras_tokenizer(text)\n",
    "    tag_tokens = tags_tokenizer.get_keras_tokenizer(ner)\n",
    "    \n",
    "    text_vocabulary_set.update(text_tokens)\n",
    "    tags_vocabulary_set.update(tag_tokens)\n",
    "    \n",
    "text_vocabulary_set.update([start_word, end_word])\n",
    "tags_vocabulary_set.update([start_word, end_word])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "text_encoder = tfds.features.text.TokenTextEncoder(text_vocabulary_set, oov_token=unknown_word, tokenizer=text_tokenizer)\n",
    "tags_encoder = tfds.features.text.TokenTextEncoder(tags_vocabulary_set, oov_token=unknown_word, tokenizer=tags_tokenizer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Switzerland ---> 1\n",
      "L ---> 2\n",
      "4 ---> 3\n",
      "Betschart ---> 4\n",
      "s ---> 5\n",
      "West ---> 6\n",
      "7 ---> 7\n",
      "D ---> 8\n",
      "<END> ---> 9\n",
      "Bank ---> 10\n",
      "Gas ---> 11\n",
      "on ---> 12\n",
      "Bruno ---> 13\n",
      "newsroom ---> 14\n",
      "year ---> 15\n",
      "sets ---> 16\n",
      ". ---> 17\n",
      "248 ---> 18\n",
      "' ---> 19\n",
      "+ ---> 20\n",
      "approves ---> 21\n",
      "Risi ---> 22\n",
      "Helsinki ---> 23\n",
      "<START> ---> 24\n",
      "terms ---> 25\n",
      "- ---> 26\n",
      "Israel ---> 27\n",
      "bishi ---> 28\n",
      "50 ---> 29\n",
      "Moreau ---> 30\n",
      "losing ---> 31\n",
      "to ---> 32\n",
      "( ---> 33\n",
      "straight ---> 34\n",
      "bronze ---> 35\n",
      "medal ---> 36\n",
      "/ ---> 37\n",
      "faster ---> 38\n",
      "22 ---> 39\n",
      "takes ---> 40\n",
      "G ---> 41\n",
      "as ---> 42\n",
      "semifinalist ---> 43\n",
      "F ---> 44\n",
      "P ---> 45\n",
      "358 ---> 46\n",
      "680 ---> 47\n",
      "Arafat ---> 48\n",
      "M ---> 49\n",
      "0 ---> 50\n",
      ") ---> 51\n",
      "flight ---> 52\n",
      "A ---> 53\n",
      "Kurt ---> 54\n",
      "W ---> 55\n"
     ]
    }
   ],
   "source": [
    "for token, id in text_encoder._token_to_id.items():\n",
    "    print(token,\"--->\", id+1) # Be default 0 is used PAD index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I-ORG ---> 1\n",
      "<START> ---> 2\n",
      "B-ORG ---> 3\n",
      "O ---> 4\n",
      "B-PER ---> 5\n",
      "<END> ---> 6\n",
      "I-PER ---> 7\n",
      "B-LOC ---> 8\n",
      "I-LOC ---> 9\n"
     ]
    }
   ],
   "source": [
    "for token, id in tags_encoder._token_to_id.items():\n",
    "    print(token, \"--->\", id+1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "11"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tags_encoder.vocab_size # i.e above tags + PAD + UNK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'4. Kurt Betschart - Bruno Risi ( Switzerland ) 22'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_data[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'O B-PER I-PER O B-PER I-PER O B-LOC O O'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ner_data[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[3, 17, 54, 4, 26, 13, 22, 33, 1, 51, 39]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res = text_encoder.encode(text_data[0])\n",
    "res"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4 O 3\n",
      ". B-PER 17\n",
      "Kurt I-PER 54\n",
      "Betschart O 4\n",
      "- B-PER 26\n",
      "Bruno I-PER 13\n",
      "Risi O 22\n",
      "( B-LOC 33\n",
      "Switzerland O 1\n",
      ") O 51\n"
     ]
    }
   ],
   "source": [
    "for text_token, tag_token, id in zip(text_tokenizer.get_keras_tokenizer(text_data[0]), ner_data[0].split(\" \"), res):\n",
    "    print(text_token, tag_token, id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**As you can see \"4.\" is splitted into \"4\" and \".\"**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Keras API\n",
    "- If you are lucky enough and had patient to read this tutorial https://www.tensorflow.org/tutorials/text/nmt_with_attention or who loves Keras, then your requirement for Text preprocessing is met.\n",
    "\n",
    "- https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer?version=stable\n",
    "\n",
    "## Word Encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "def keras_tokenize(text_corpus, char_level=False, filters='!\"#$%&()*+,-./:;<=>?@[\\\\]^_`{|}~\\t\\n'):\n",
    "    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters=filters, oov_token=\"<UNK>\", char_level=char_level, lower=False)\n",
    "    lang_tokenizer.fit_on_texts(text_corpus)\n",
    "    return lang_tokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "text_word_tokenizer = keras_tokenize(text_data, filters='')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " Lets print word index for our data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{1: '<UNK>',\n",
       " 2: '-',\n",
       " 3: '.',\n",
       " 4: 'G',\n",
       " 5: '/',\n",
       " 6: '4.',\n",
       " 7: 'Kurt',\n",
       " 8: 'Betschart',\n",
       " 9: 'Bruno',\n",
       " 10: 'Risi',\n",
       " 11: '(',\n",
       " 12: 'Switzerland',\n",
       " 13: ')',\n",
       " 14: '22',\n",
       " 15: 'Israel',\n",
       " 16: 'approves',\n",
       " 17: 'Arafat',\n",
       " 18: \"'s\",\n",
       " 19: 'flight',\n",
       " 20: 'to',\n",
       " 21: 'West',\n",
       " 22: 'Bank',\n",
       " 23: 'Moreau',\n",
       " 24: 'takes',\n",
       " 25: 'bronze',\n",
       " 26: 'medal',\n",
       " 27: 'as',\n",
       " 28: 'faster',\n",
       " 29: 'losing',\n",
       " 30: 'semifinalist',\n",
       " 31: 'W',\n",
       " 32: 'D',\n",
       " 33: 'L',\n",
       " 34: 'F',\n",
       " 35: 'A',\n",
       " 36: 'P',\n",
       " 37: '--',\n",
       " 38: 'Helsinki',\n",
       " 39: 'newsroom',\n",
       " 40: '+358',\n",
       " 41: '0',\n",
       " 42: '680',\n",
       " 43: '50',\n",
       " 44: '248',\n",
       " 45: \"M'bishi\",\n",
       " 46: 'Gas',\n",
       " 47: 'sets',\n",
       " 48: 'terms',\n",
       " 49: 'on',\n",
       " 50: '7-year',\n",
       " 51: 'straight'}"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_word_tokenizer.index_word"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, if you wanna to convert your text data into intergers.."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 6,  7,  8,  2,  9, 10, 11, 12, 13, 14],\n",
       "       [15, 16, 17, 18, 19, 20, 21, 22,  3,  0],\n",
       "       [23, 24, 25, 26, 27, 28, 29, 30,  3,  0],\n",
       "       [31, 32, 33,  4,  5, 34,  4,  5, 35, 36],\n",
       "       [37, 38, 39, 40,  2, 41,  2, 42, 43, 44],\n",
       "       [45, 46, 47, 48, 49, 50, 51,  3,  0,  0]], dtype=int32)"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res = text_word_tokenizer.texts_to_sequences(text_data)\n",
    "# Easy to use padding API from Keras\n",
    "res = tf.keras.preprocessing.sequence.pad_sequences(res, padding='post')\n",
    "res"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To test how out of vocab index values are used, we can feed tags to text tokenizer ;) and see `1s` "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
       " [1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
       " [1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
       " [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
       " [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
       " [1, 1, 1, 1, 1, 1, 1, 1]]"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_word_tokenizer.texts_to_sequences(ner_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "ner_word_tokenizer = keras_tokenize(ner_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[2, 3, 4, 6, 4, 2, 3, 4, 6, 4, 2, 3, 5, 2, 2],\n",
       "       [3, 5, 2, 3, 4, 2, 2, 2, 3, 5, 6, 5, 2, 0, 0],\n",
       "       [3, 4, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0],\n",
       "       [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0],\n",
       "       [2, 3, 5, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0],\n",
       "       [3, 7, 6, 7, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0]], dtype=int32)"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res = ner_word_tokenizer.texts_to_sequences(ner_data)\n",
    "res = tf.keras.preprocessing.sequence.pad_sequences(res, padding=\"post\")\n",
    "res"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{1: '<UNK>', 2: 'O', 3: 'B', 4: 'PER', 5: 'LOC', 6: 'I', 7: 'ORG'}"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ner_word_tokenizer.index_word"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vocab_size = len(ner_word_tokenizer.word_index) + 1\n",
    "vocab_size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Character Encoding\n",
    "\n",
    "Character level encoding will be useful when you wanted to capture the sematics at word level which can help you to deal with out of vocab words, deeper understanding of words with repect to its position etc.,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "text_char_tonkenizer = keras_tokenize(text_data, char_level=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'4. Kurt Betschart - Bruno Risi ( Switzerland ) 22'"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_data[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{1: '<UNK>',\n",
       " 2: ' ',\n",
       " 3: 's',\n",
       " 4: 'e',\n",
       " 5: 'a',\n",
       " 6: 't',\n",
       " 7: 'r',\n",
       " 8: 'i',\n",
       " 9: 'n',\n",
       " 10: 'o',\n",
       " 11: 'l',\n",
       " 12: '-',\n",
       " 13: '.',\n",
       " 14: 'h',\n",
       " 15: 'f',\n",
       " 16: 'm',\n",
       " 17: 'u',\n",
       " 18: 'B',\n",
       " 19: '2',\n",
       " 20: 'g',\n",
       " 21: 'k',\n",
       " 22: 'G',\n",
       " 23: '8',\n",
       " 24: '0',\n",
       " 25: '4',\n",
       " 26: 'w',\n",
       " 27: 'z',\n",
       " 28: 'd',\n",
       " 29: 'p',\n",
       " 30: 'A',\n",
       " 31: \"'\",\n",
       " 32: 'W',\n",
       " 33: 'M',\n",
       " 34: 'b',\n",
       " 35: '/',\n",
       " 36: '5',\n",
       " 37: 'K',\n",
       " 38: 'c',\n",
       " 39: 'R',\n",
       " 40: '(',\n",
       " 41: 'S',\n",
       " 42: ')',\n",
       " 43: 'I',\n",
       " 44: 'v',\n",
       " 45: 'D',\n",
       " 46: 'L',\n",
       " 47: 'F',\n",
       " 48: 'P',\n",
       " 49: 'H',\n",
       " 50: '+',\n",
       " 51: '3',\n",
       " 52: '6',\n",
       " 53: '7',\n",
       " 54: 'y'}"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_char_tonkenizer.index_word\n",
    "# care needs to be taken while using cahracter level tokenizing with OOV, \n",
    "# since all the characters will be part of our vocab. This can happen when we wanted to \n",
    "# get_keras_tokenizer a differnt language or different string encoding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['4.', 'Kurt', 'Betschart', '-', 'Bruno', 'Risi', '(', 'Switzerland', ')', '22'], ['Israel', 'approves', 'Arafat', \"'s\", 'flight', 'to', 'West', 'Bank', '.'], ['Moreau', 'takes', 'bronze', 'medal', 'as', 'faster', 'losing', 'semifinalist', '.'], ['W', 'D', 'L', 'G', '/', 'F', 'G', '/', 'A', 'P'], ['--', 'Helsinki', 'newsroom', '+358', '-', '0', '-', '680', '50', '248'], [\"M'bishi\", 'Gas', 'sets', 'terms', 'on', '7-year', 'straight', '.']]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[array([[25, 13,  0,  0,  0,  0],\n",
       "        [37, 17,  7,  6,  0,  0],\n",
       "        [ 3, 38, 14,  5,  7,  6],\n",
       "        [12,  0,  0,  0,  0,  0],\n",
       "        [18,  7, 17,  9, 10,  0],\n",
       "        [39,  8,  3,  8,  0,  0],\n",
       "        [40,  0,  0,  0,  0,  0],\n",
       "        [ 4,  7, 11,  5,  9, 28],\n",
       "        [42,  0,  0,  0,  0,  0],\n",
       "        [19, 19,  0,  0,  0,  0]], dtype=int32),\n",
       " array([[43,  3,  7,  5,  4, 11],\n",
       "        [29,  7, 10, 44,  4,  3],\n",
       "        [30,  7,  5, 15,  5,  6],\n",
       "        [31,  3,  0,  0,  0,  0],\n",
       "        [15, 11,  8, 20, 14,  6],\n",
       "        [ 6, 10,  0,  0,  0,  0],\n",
       "        [32,  4,  3,  6,  0,  0],\n",
       "        [18,  5,  9, 21,  0,  0],\n",
       "        [13,  0,  0,  0,  0,  0]], dtype=int32),\n",
       " array([[33, 10,  7,  4,  5, 17],\n",
       "        [ 6,  5, 21,  4,  3,  0],\n",
       "        [34,  7, 10,  9, 27,  4],\n",
       "        [16,  4, 28,  5, 11,  0],\n",
       "        [ 5,  3,  0,  0,  0,  0],\n",
       "        [15,  5,  3,  6,  4,  7],\n",
       "        [11, 10,  3,  8,  9, 20],\n",
       "        [ 9,  5, 11,  8,  3,  6],\n",
       "        [13,  0,  0,  0,  0,  0]], dtype=int32),\n",
       " array([[32,  0,  0,  0,  0,  0],\n",
       "        [45,  0,  0,  0,  0,  0],\n",
       "        [46,  0,  0,  0,  0,  0],\n",
       "        [22,  0,  0,  0,  0,  0],\n",
       "        [35,  0,  0,  0,  0,  0],\n",
       "        [47,  0,  0,  0,  0,  0],\n",
       "        [22,  0,  0,  0,  0,  0],\n",
       "        [35,  0,  0,  0,  0,  0],\n",
       "        [30,  0,  0,  0,  0,  0],\n",
       "        [48,  0,  0,  0,  0,  0]], dtype=int32),\n",
       " array([[12, 12,  0,  0,  0,  0],\n",
       "        [11,  3,  8,  9, 21,  8],\n",
       "        [26,  3,  7, 10, 10, 16],\n",
       "        [50, 51, 36, 23,  0,  0],\n",
       "        [12,  0,  0,  0,  0,  0],\n",
       "        [24,  0,  0,  0,  0,  0],\n",
       "        [12,  0,  0,  0,  0,  0],\n",
       "        [52, 23, 24,  0,  0,  0],\n",
       "        [36, 24,  0,  0,  0,  0],\n",
       "        [19, 25, 23,  0,  0,  0]], dtype=int32),\n",
       " array([[31, 34,  8,  3, 14,  8],\n",
       "        [22,  5,  3,  0,  0,  0],\n",
       "        [ 3,  4,  6,  3,  0,  0],\n",
       "        [ 6,  4,  7, 16,  3,  0],\n",
       "        [10,  9,  0,  0,  0,  0],\n",
       "        [53, 12, 54,  4,  5,  7],\n",
       "        [ 7,  5,  8, 20, 14,  6],\n",
       "        [13,  0,  0,  0,  0,  0]], dtype=int32)]"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# split the text by spaces i.e list of list of words\n",
    "char_data = [text.split(\" \") for text in text_data]\n",
    "print(char_data)\n",
    "\n",
    "char_data_encoded = []\n",
    "for char_seq in char_data:\n",
    "    # get_keras_tokenizer each sentence\n",
    "    res = text_char_tonkenizer.texts_to_sequences(char_seq)\n",
    "    # pad it \n",
    "    res = tf.keras.preprocessing.sequence.pad_sequences(res, padding=\"post\", maxlen=6)\n",
    "    # group it as a batch\n",
    "    char_data_encoded.append(res)\n",
    "    \n",
    "char_data_encoded"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TF Text APIs\n",
    "- https://github.com/tensorflow/text\n",
    "- https://www.tensorflow.org/tutorials/tensorflow_text/intro\n",
    "- https://blog.tensorflow.org/2019/06/introducing-tftext.html\n",
    "\n",
    "The last one is Tensorflow Text APIs. From first glance it seems to have good integration with the Tensorflow Dataset APIs and Keras.\n",
    "\n",
    "Since my current requirements are met with Keras preprocessing APIs, I am keepin theis for later time exploration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From /home/mageswarand/anaconda3/envs/tie/lib/python3.6/site-packages/tensorflow_core/python/util/dispatch.py:180: batch_gather (from tensorflow.python.ops.array_ops) is deprecated and will be removed after 2017-10-25.\n",
      "Instructions for updating:\n",
      "`tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims=-1` instead.\n",
      "[[b'everything', b'not', b'saved', b'will', b'be', b'lost.'], [b'Sad\\xe2\\x98\\xb9']]\n"
     ]
    }
   ],
   "source": [
    "tokenizer = tensorflow_text.WhitespaceTokenizer()\n",
    "tokens = tokenizer.get_keras_tokenizer(['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])\n",
    "print(tokens.to_list())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "text_tokens = tokenizer.get_keras_tokenizer(text_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tf.Tensor: id=1036, shape=(56,), dtype=string, numpy=\n",
       "array([b'4.', b'Kurt', b'Betschart', b'-', b'Bruno', b'Risi', b'(',\n",
       "       b'Switzerland', b')', b'22', b'Israel', b'approves', b'Arafat',\n",
       "       b\"'s\", b'flight', b'to', b'West', b'Bank', b'.', b'Moreau',\n",
       "       b'takes', b'bronze', b'medal', b'as', b'faster', b'losing',\n",
       "       b'semifinalist', b'.', b'W', b'D', b'L', b'G', b'/', b'F', b'G',\n",
       "       b'/', b'A', b'P', b'--', b'Helsinki', b'newsroom', b'+358', b'-',\n",
       "       b'0', b'-', b'680', b'50', b'248', b\"M'bishi\", b'Gas', b'sets',\n",
       "       b'terms', b'on', b'7-year', b'straight', b'.'], dtype=object)>"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_tokens.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
